Clemson HGIC Home & Garden Factsheet Scraper avatar

Clemson HGIC Home & Garden Factsheet Scraper

Pricing

Pay per event

Go to Apify Store
Clemson HGIC Home & Garden Factsheet Scraper

Clemson HGIC Home & Garden Factsheet Scraper

Scrapes the Clemson HGIC factsheet library — 2,500+ science-based factsheets on plant care, diseases, pest management, lawn care, and food preservation. Outputs structured records: HGIC ID, body sections, symptoms, causal agent, management, products, authors.

Pricing

Pay per event

Rating

0.0

(0)

Developer

BowTiedRaccoon

BowTiedRaccoon

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

5 days ago

Last modified

Share

Scrapes the Clemson Home & Garden Information Center (HGIC) factsheet library — 2,500+ science-based factsheets covering plant care, diseases, pest management, lawn care, and food preservation. Outputs structured records with HGIC ID, body sections, symptoms, causal agent, management recommendations, recommended products, authors, and related factsheets.

What It Does

Clemson HGIC is one of the largest university extension factsheet libraries in the US (SE US plant palette, 2,500+ documents). Each factsheet follows a consistent template with discrete sections: symptoms, causal agent (pathogen/pest binomial), management/control recommendations, and prevention. This actor parses that structure into machine-readable fields — exactly what plant-diagnosis apps, AI garden assistants, and agronomy SaaS platforms need as grounding data.

The actor reads the Yoast sitemap index to enumerate all factsheet URLs, then crawls each page with impit Chrome TLS fingerprinting — no proxy or CAPTCHA solver required.

Use Cases

  • Training data for plant disease diagnosis AI and AI garden assistant models
  • Structured extension knowledge base for horticulture SaaS
  • Agronomy/landscaping content and reference data pipelines
  • Garden app content enrichment (symptom/treatment lookup)

Input

FieldTypeDefaultDescription
maxItemsinteger10Maximum number of factsheets to scrape. Set to a large number to scrape all ~2,500+ factsheets.

Output

Each item represents one HGIC factsheet.

FieldTypeDescription
factsheet_idstringHGIC factsheet number, e.g. HGIC 1223
slugstringURL slug, e.g. turfgrasses-for-the-carolinas
titlestringFactsheet title
categorystringSubject category: Diseases, Insects, Lawns, Soils, Vegetables, Trees & Shrubs, Flowers, Fruits & Nuts, Food Safety & Preservation, Human Health & Safety, General
plant_subjectsstringComma-separated plant names from the title
problem_typestringProblem type: disease, insect, cultural, or none
summarystringFirst meaningful paragraph / introductory text
body_sectionsstringJSON array of {heading, text} objects for the full structured body
symptomsstringSymptom description text (for disease/pest/damage factsheets)
causal_agentstringPathogen or pest scientific/common name
managementstringManagement and control recommendation text
preventionstringPrevention and cultural practices text
recommended_productsstringComma-separated trade names and chemistries found in management sections
related_factsheetsstringComma-separated related factsheet links (`title
last_updatedstringRevision date as shown in factsheet metadata, e.g. Feb 28, 2016
authorsstringComma-separated list of factsheet authors
imagesstringComma-separated image URLs embedded in the factsheet
factsheet_urlstringCanonical URL of the factsheet
scrapedAtstringISO-8601 timestamp when the record was scraped

Sample Output

{
"factsheet_id": "HGIC 1223",
"slug": "turfgrasses-for-the-carolinas",
"title": "Turfgrasses for the Carolinas",
"category": "Lawns",
"problem_type": "none",
"summary": "For over 50 years the lawn has been an integral part of the landscape...",
"body_sections": "[{\"heading\":\"Mowing\",\"text\":\"...\"}]",
"last_updated": "Feb 28, 2016",
"authors": "Millie Davenport, Gary Forrester",
"factsheet_url": "https://hgic.clemson.edu/factsheet/turfgrasses-for-the-carolinas/"
}

Discovery Method

Reads the Yoast sitemap index at https://hgic.clemson.edu/sitemap.xml, filters for factsheet-sitemap.xml and factsheet-sitemap2.xml, and collects all /factsheet/<slug>/ URLs. The maxItems cap is applied before crawling begins.

Performance

  • Memory: 128–256 MB
  • Throughput: ~200 pages/minute at default concurrency (5)
  • Full corpus (~2,500 factsheets): ~15–20 minutes
  • Timeout: 2-hour default (sufficient for full corpus)